How to Beat the Adaptive Multi-Armed Bandit

نویسندگان

Varsha Dani

Thomas P. Hayes

چکیده

The multi-armed bandit is a concise model for the problem of iterated decision-making under uncertainty. In each round, a gambler must pull one of K arms of a slot machine, without any foreknowledge of their payouts, except that they are uniformly bounded. A standard objective is to minimize the gambler’s regret, de ned as the gambler’s total payout minus the largest payout which would have been achieved by any xed arm, in hindsight. Note that the gambler is only told the payout for the arm actually chosen, not for the unchosen arms. Almost all previous work on this problem assumed the payouts to be non-adaptive, in the sense that the distribution of the payout of arm j in round i is completely independent of the choices made by the gambler on rounds 1; : : : ; i 1. In the more general model of adaptive payouts, the payouts in round i may depend arbitrarily on the history of past choices made by the algorithm. We present a new algorithm for this problem, and prove nearly optimal guarantees for the regret against both non-adaptive and adaptive adversaries. After T rounds, our algorithm has regret O( p T ) with high probability (the tail probability decays exponentially). This dependence on T is best possible, and matches that of the full-information version of the problem, in which the gambler is told the payouts for all K arms after each round. Previously, even for non-adaptive payouts, the best high-probability bounds known were O(T 2=3), due to Auer, Cesa-Bianchi, Freund and Schapire [1]. For non-adaptive payouts, they also proved an O( p T ) bound on expected regret. We describe an adaptive payout scheme for which the expected regret of their algorithm is (T 2=3).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Blinded Bandit: Learning with Adaptive Feedback

We study an online learning setting where the player is temporarily deprived of feedback each time it switches to a different action. Such model of adaptive feedback naturally occurs in scenarios where the environment reacts to the player’s actions and requires some time to recover and stabilize after the algorithm switches actions. This motivates a variant of the multi-armed bandit problem, wh...

متن کامل

Multi armed bandit problem: some insights

Multi Armed Bandit problems have been widely studied in the context of sequential analysis. The application areas include clinical trials, adaptive filtering, online advertising etc. The study is also characterized as a policy selection which maximizes a gambler’s reward when there are multiple slot machines that are generating them. It is under this framework, that we describe the model and de...

متن کامل

On Finding the Largest Mean Among Many

Sampling from distributions to find the one with the largest mean arises in a broad range of applications, and it can be mathematically modeled as a multi-armed bandit problem in which each distribution is associated with an arm. This paper studies the sample complexity of identifying the best arm (largest mean) in a multi-armed bandit problem. Motivated by large-scale applications, we are espe...

متن کامل

Multi-Armed Bandit Policies for Reputation Systems

The robustness of reputation systems against manipulations have been widely studied. However, the study of how to use the reputation values computed by those systems are rare. In this paper, we draw the analogy between reputation systems and multi-armed bandit problems. We investigate how to use the multi-armed bandit selection policies in order to increase the robustness of reputation systems ...

متن کامل

Mistake Bounds on Noise-Free Multi-Armed Bandit Game

We study the {0, 1}-loss version of adaptive adversarial multi-armed bandit problems with α(≥ 1) lossless arms. For the problem, we show a tight bound K − α − Θ(1/T ) on the minimax expected number of mistakes (1-losses), where K is the number of arms and T is the number of rounds.

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/cs/0602053 شماره

صفحات -

تاریخ انتشار 2005

How to Beat the Adaptive Multi-Armed Bandit

نویسندگان

چکیده

منابع مشابه

The Blinded Bandit: Learning with Adaptive Feedback

Multi armed bandit problem: some insights

On Finding the Largest Mean Among Many

Multi-Armed Bandit Policies for Reputation Systems

Mistake Bounds on Noise-Free Multi-Armed Bandit Game

عنوان ژورنال:

اشتراک گذاری